Intelligent architecture creator

ABSTRACT

Systems and methods are disclosed to automatically generate a processor architecture for a custom integrated circuit (IC) described by a computer readable code. The IC has one or more timing and hardware constraints. The system extracts parameters defining the processor architecture from a static profile and a dynamic profile of the computer readable code; iteratively optimizes the processor architecture by changing one or more parameters until all timing and hardware constraints expressed as a cost function are met; and synthesizes the generated processor architecture into a computer readable description of the custom integrated circuit for semiconductor fabrication.

This application is a continuation of application Ser. No. 12/906,857filed 18-OCT-2010, the content of which is incorporated by reference.

The present invention relates to a method for automatically generatingan optimal architecture for a custom integrated circuit (IC) or anapplication-specific integrated circuit (ASIC).

BACKGROUND

Modern electronic appliances and industrial products rely on electronicdevices such as standard and custom integrated circuits (ICs). An ICdesigned and manufactured for specific purposes is called an ASIC. Thenumber of functions, which translates to transistors, included in eachof those ICs has been rapidly growing year after year due to advances insemiconductor technology.

Normally the chip design process begins when algorithm designers specifyall the functionality that the chip must perform. This is usually donein a language like C or Matlab. A team of chip specialists, toolsengineers, verification engineers and firmware engineers then work manyman-years to map the algorithm to a hardware chip and associatedfirmware. The team can use an off-the-shelf processor, which is provenbut may have performance limitations because the standard architecturemay not fit well with the algorithm.

The alternative is to design a custom architecture and custom hardwareto achieve high performance for the desired algorithm. A computerarchitecture is a detailed specification of the computational,communication, and data storage elements (hardware) of a computersystem, how those components interact (machine organization), and howthey are controlled (instruction set). A machine's architecturedetermines which computations can be performed most efficiently, andwhich forms of data organization and program design will performoptimally.

The custom chip approach is a very expensive process and also fraughtwith risks from cost-overruns to technical problems. Developingcutting-edge custom IC designs introduces many issues that need to beresolved. Higher processing speeds have introduced conditions into theanalog domain that were formerly purely digital in nature, such asmultiple clock regions, increasingly complex clock multiplication andsynchronization techniques, noise control, and high-speed I/O.

Another effect of increased design complexity is the additional numberof production turns that may be needed to achieve a successful design.Yet another issue is the availability of skilled workers. The rapidgrowth in ASIC circuit design has coincided with a shortage of skilledIC engineers.

SUMMARY

In one aspect, systems and methods are disclosed to automaticallygenerate a custom integrated circuit (IC) described by a computerreadable code or model. The IC has one or more timing and hardwareconstraints. The system extracts parameters defining the processorarchitecture from a static profile and a dynamic profile of the computerreadable code; iteratively optimizes the processor architecture bychanging one or more parameters until all timing and hardwareconstraints expressed as a cost function are met; and synthesizes thegenerated processor architecture into a computer readable description ofthe custom integrated circuit for semiconductor fabrication.

Implementations of the above aspects may include one or more of thefollowing. The system can optimize processor scalarity and instructiongrouping rules. The system can also optimize the number of cores neededand automatically splits the instruction stream to use the coreseffectively. The processor architecture optimization includes changingan instruction set. The system's changing an instruction set includesreducing the number of instructions required and encoding theinstructions to improve instruction access, decode speed and instructionmemory size requirements The processor architecture optimizationincludes changing one of: a register file port, port width, and numberof ports to data memory. The processor architecture optimizationincludes changing one of: data memory size, data cache pre-fetch policy,data cache policy Instruction memory size, instruction cache pre-fetchpolicy and instruction cache policy. The processor architectureoptimization includes adding a co-processor. The system canautomatically generate a new instruction uniquely customized to thecomputer readable code to improve performance of the processorarchitecture. The system includes pre-processing the computer readablecode by determining a memory location for each pointer variable; andinserting an instrumentation for each line. The system includes parsingthe computer readable code, and further includes removing dummyassignments; removing redundant loop operations; identifying requiredmemory bandwidth; replacing one or more software implemented flags asone or more hardware flags; and reusing expired variables. Theextracting parameters further includes determining an execution cycletime for each line; determining an execution clock cycle count for eachline; determining clock cycle count for one or more bins; generating anoperator statistic table; generating statistics for each function; andsorting lines by descending order of execution count. The system canmold commonly used instructions into one or more groups and generating acustom instruction for each group to improve performance (instructionmolding). The system includes checking for a molding violation in thenew instruction candidate. A cost function can be used to select aninstruction molding candidate (IMC). IMCs can be based on statisticaldependence. The system can determine timing and area costs for thearchitecture parameter change. Sequences in the program that could bereplaced with the IMCs are identified. This includes the ability torearrange instructions within a sequence to maximize the fit withoutcompromising the functionality of the code. The system can track pointermarching and building statistics regarding stride and memory accesspatterns and memory dependency to optimize cache pre-fetching and acache policy.

The system also includes performing static profiling of the computerreadable code and/or dynamic profiling of the computer readable code. Asystem chip specification is designed based on the profiles of thecomputer readable code. The chip specification can be further optimizedincrementally based on static and dynamic profiling of the computerreadable code. The computer readable code can be compiled into optimalassembly code, which is linked to generate firmware for the selectedarchitecture. A simulator can perform cycle accurate simulation of thefirmware. The system can perform dynamic profiling of the firmware. Themethod includes optimizing the chip specification further based onprofiled firmware or based on the assembly code. The system canautomatically generate register transfer level (RTL) code for thedesigned chip specification. The system can also perform synthesis ofthe RTL code to fabricate silicon.

Advantages of the preferred embodiments may include one or more of thefollowing. The system automates the evaluation process so that all costsare taken into consideration and system designer gets the best possiblenumber representation and bit width candidates to evaluate. The methodcan evaluate the area, timing and power cost of a given architecture ina quick and automated fashion. This methodology is used as a costcomputing engine. The method enables the synthesis of the DSPautomatically based on the algorithm in an optimal fashion. The systemdesigner does not need to be aware of the hardware area, delay and powercost associated with the choice of a particular representation overanother one. The system allows hardware area, delay and power to bemodeled as accurately as possible at the algorithm evaluation stage.

Other advantages of the preferred embodiments of the system may includeone or more of the following. The system alleviates the problems of chipdesign and makes it a simple process. The embodiments shift the focus ofproduct development process back from the hardware implementationprocess back to product specification and computer readable code oralgorithm design. Instead of being tied down to specific hardwarechoices, the computer readable code or algorithm can be implemented on aprocessor that is optimized specifically for that application. Thepreferred embodiment generates an optimized processor automaticallyalong with all the associated software tools and firmware applications.This process can be done in a matter of days instead of years as isconventional. The described automatic system removes the risk and makeschip design an automatic process so that the algorithm designersthemselves can directly make the hardware chip without any chip designknowledge since the primary input to the system is the computer readablecode, model or algorithm specification rather than low level primitives.

Yet other benefits of using the system may include

-   -   1) Speed: If chip design cycles become measured in weeks instead        of years, the companies using the system can penetrate rapidly        changing markets by bringing their products quickly to the        market.    -   2) Cost: The numerous engineers that are usually needed to be        employed to implement chips are made redundant. This brings        about tremendous cost savings to the companies using the instant        system.    -   3) Optimality: The chips designed using the instant system        product have superior performance, area and power consumption.

The instant system is a complete shift in paradigm in methodology usedin design of systems that have a digital chip component to it. Thesystem is a completely automated software product that generates digitalhardware from algorithms described in C/Matlab. The system uses a uniqueapproach to the process of taking a high level language such as C orMatlab to realizable hardware chip. In a nutshell, it makes chip designa completely automated software process.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an exemplary system to automatically generate anarchitecture for a custom IC or ASIC device whose functionality isspecified by a program, code or computer model.

FIG. 2 shows in more details an exemplary preprocessor used in FIG. 1.

FIG. 3 shows in more details an exemplary parser used in FIG. 1.

FIG. 4 shows in more details an exemplary parameter extraction module inFIG. 1.

FIG. 5 shows an exemplary process to iteratively generate an optimalarchitecture for a custom hardware solution from a computer program.

FIG. 6 shows an exemplary system to automatically generate a custom ICwith the architecture defined in FIG. 5.

DESCRIPTION

FIG. 1 shows an exemplary system to automatically determine the bestarchitecture for a custom IC or ASIC device whose functionality isspecified by a program, code or computer model. Different stagesinvolved in obtaining architecture definition for a given computerreadable code or program (1) provided as input. In one embodiment, theprogram is written in the C-language, but other languages such asMatlab, Python, or Java can be used as well. In a pre-processor (2), theinput program (1) is formatted before such program is analyzed by aparser (3). In the formatting process, functionality of the program (1)is preserved. The parser (3) initially uses a basic architecture toextract all the information and creates a database. The system thencollects static and dynamic profiles of the program. From the output ofthe parser (3), parameters required for the architecture definition (5)is extracted by a parameter extractor (4). With these parameters and thereal time and hardware constraints to be met by the program (1) asinputs, the system iteratively determines the most suitable architectureat the given stage for the given C-program. The architecture is used toparse the C-program and parameter extraction. A new architecture isdefined again. This loop continues until the best architecture whichgives the best time, area and power performance, is defined.

FIG. 2 shows in more details an exemplary preprocessor used in FIG. 1.The preprocessor (2) receives the program (1) and converts the programinto code with only one operator per line (10). Loops in the programalso are converted into the form “if . . . goto . . . else goto . . . ”(11). Next, the system replaces occurrences of the directive “# definevariable” with their respective constant values (12). The systemdetermines the memory location for each pointer variable (13), andinserts instrumentation for each line (14). Next, exemplaryillustrations of the operations of FIG. 2 to exemplary C-programs willbe discussed.

In (10), multiple operators line in the C-program is formatted to haveone operator per line (10). Thus, the exemplary code

   int a,b,c,d;    d = d + (a *b) /c; is changed to    int D1182;    intD1183;    int a;    int b;    int c;    int d;    D1182 = a * b;   D1183 = D1182 / c;    d = D1183 + d;

In (11), the “if goto else goto” conversion converts the exemplary code

   nt i,a;    for(i=1;i<10;i++)    a+=10; to    int i;    int a;    i =1 ;    goto D1181 ;    D1180: ;    a = a + 10 ;    i = i + 1 ;    D1181:;    if (i <= 9)    {       goto D1180 ;    }    else    {       gotoD1182 ;    }    D1182: ;

In the replacement of “#define” variables with respective constantvalues (12), the exemplary code

   #define data 10    main( )    {    int i,a;    if(i<data)   i=i+data;    else    i=0;    } is converted to main ( ) { int i; inta; if (i <= 9)    {    i = i + 10;    } else    {    i = 0;    } }

In (13), the exact memory location for each pointer variable dependingupon its data type is calculated. Thus, the following exemplary code

   int *a;    char *d;    *a=10;    a++;    d=a;    *d=‘c’;    d++;   *d=‘b’; is converted to int * a; char * d; *a = 10; a = a + 4; d =(char *) a; *d = 99; d = d + 1; *d = 98;

In (14), the process inserts instrumentation for every line to get thedynamic profile of the C program. For example, the code

D1458 = *sig_out; D1459 = (int) D1458 ;is instrumented with “printf” function insertion as follows:

D1458 = *sig_out; printf(“0\t”); printf(“0x%x\n”, (unsigned int) D1458); D1459 = (int) D1458 ; printf(“1\t”); printf(“0x%x\n”, (unsigned int)D1459 );

FIG. 3 shows in more details an exemplary parser used in FIG. 1. In oneembodiment, the formatted C program from the pre-processor (2) isexecuted and the results of the execution are logged. The executioncovers a variety of use case scenarios and is a thorough test suite. Anincomplete test suite could lead to improper architecture definition.The formatted C program and the logged results are fed as input to theparser (3). In one implementation, the parser (3) performs the followingoperations:

In (20), the process builds a list of all variables used in the program(1). Each variable has associated properties that identify the variableuniquely. Some of the properties are information related to data type,whether the variable is an array or pointer.

In (22), the process builds a list of all operators used in the program.Each operator is also given a set of properties that describe thefunction of the operator. The properties are defined so that complexoperators can be defined as a combination of the simple operatorsdefined in the Basic Architecture (7).

In (24), the lines of executable code in the program are mapped into adata structure. All information about the lines is available in the datastructure. This structure links into the variable list and operatorlist. Any line is uniquely identified by the variables and operatorsused in the line.

In (26), functions are identified and a list of functions is maintained.

In (28), the logged results from the execution of the C program areparsed and all relevant dynamic information is gathered. This is used toupdate the variable list and data structure of lines of code.

In (30), the C program could contain many lines that might be optimizedby the compiler. For example, unnecessary assignments would be removedby the compiler. Such lines that could be potentially removed by thecompiler are identified and marked as “dummy” lines. The algorithm toperform this is described in a separate section below.

In (32), the system optimizes any multiplication by a power of 2 to aleft shift and any division by a power of 2 to a right shift(considering only positive powers). All lines that have multiplicationor division operators by powers of 2 are replaced with right or leftshifts. This ensures in the correct identification of statisticsassociated with the execution of the program.

In (34), the process optimizes redundant operations in loops. It ispossible that some other form of optimization with regard to loops canbe implemented by compilers. An algorithm is used to track such lines ofcode and duplicate the same optimization in the line data structure thathas been created. This algorithm is also explained in a separate sectionbelow.

In (36), the lines of execution are now separated into two primary binsfor architecture classification namely data manipulation and addressmanipulation. This is a very critical distinction to be arrived at sinceit would drive some significant architecture decisions. Any program thathas lots of address manipulation operations would benefit from aseparate address manipulation unit while such a hardware would be anoverkill for other applications. The algorithm to do this is defined ina separate section below.

In (38), the process identifies required data memory bandwidth. Onecritical architecture decision pertains to the data memory bandwidthnecessary to run the program. All lines that are dependent on dataloaded from memory are marked under a different bin for this purpose.All lines that operated on data other than being loaded from memory—forexample it could operate on the result of a line that operated on dataloaded from memory—are marked separately. An algorithm is applied onthis data to calculate the number of ports and the width of the ports tothe data memory that would be required to facilitate the execution ofthe lines of C program with minimal stall.

In (40), hardware flags are identified and processed. The native Cprogram does not have a concept of hardware flags. Flags are typicallycoded as global variables. However, from a performance standpoint, it isimperative to identify all the hardware flags needed for the program.This is usually done by hand now. Either the C code is hand coded againin assembly to take advantage of hardware flags (and the resultingperformance win) or the compiler is manually tweaked so that some typeof coding structure with pragmas could be used to represent a flag.Neither of the options are easy or scalable. Our application has analgorithm that identifies potential Hardware flags in the native C codeand marks these global variables a flags. As part of the architecturedefinition, the hardware needed to represent these flags is alsodescribed and synthesized automatically. The algorithm to identify theflags is described in in more details below.

In (42), the process looks for expired variables that can be reused. Inorder to extract faithful parameters, it is important to consider thenumber of variables used for each line and the number of read and writeports available in the register file. In case of a mismatch, penaltycycles have to be added. However native C program is not written tooptimize the number of variables used. So using the program as suchcould result in a unrealistic number of penalty cycles. So the datastructure of lines has to be parsed and modified to minimize the use ofnew variables. This algorithm is described in a separate section below.

The dummy line identification is detailed next. The dummy assignmentcheck identifies lines in the C program that contain assignments thatare likely to be optimized out by the compiler. In one embodiment, theprocess includes code to perform the following:

1) March through the data structure of lines looking for lines withassignment statement.

2) When a assignment statement is hit, the left hand side variable andright hand side variables are marked.

3) The lines further down are investigated to verify if this assignmentis necessary or if the right hand side variable could have been directlyused instead. Any reassignment of the right hand side variable in anyline prior to the last line where the left hand side variable isreferred to would directly break this requirement.

4) However, there could be other cases (such as in conditional checks)where even if the requirement in 3 is fulfilled, the assignment is stillnecessary. A logic that understands the branch conditions and branchdepth is used to make this decision.

5) If after all these checks, it is identified that the assignment neednot have been made, the line containing the assignment operation ismarked as a dummy line.

Next, dummy variable reassignment is discussed. Once a assignment lineis marked as dummy, the variable assignment in that line has becomeredundant. So the following pseudo-code reassigns variables in order toensure consistency of the data structure of lines:

1) Identify the right hand side variable of the dummy assignment line.

2) Identify the previous line in which this variable occurred on theleft hand side.

3) Replace that left hand side variable with the left hand side variableof the dummy line. A check of branch depth has already been performed inthe dummy assignment check section. So this replacement will beconsistent.

Next, loop optimization is discussed. The process tracks possibleoptimizations that compilers are likely to perform on accessing andindexing of arrays within loops. An example is given below.

Consider the line of C code (base is a int*)

b=*(base+i);

When this line of C code is sent through the preprocessor, the resultinglines may be generated:

temp1 = i*4; temp2 = base + temp1; b = *temp2;

When this operation is performed inside a tight loop, the first line isredundant. In any loop, the next address can be easily calculated byadding 4 to the previous address. Such lines in the program are trackedand marked as dummy. If this is not done, the architecture definitionwould be unfairly skewed to account for operations that would neverexist when the code is actually compiled for the machine. So it isessential to be able to identify all such redundant operations beforeproceeding to the architecture definition stage. Pseudo-code for loopoptimization is as follows:

1) Track all variable values through iterations of execution. Thisinformation is available from the parsing of the logged results of the Cprogram due to the instrumentation inserted in 14.

2) Compare the current value of the variable to the previous value ofthe variable and store this as the difference value.

3) If the difference value remains unchanged through all iterations ofexecution and the line of code actually happens to be within a loop(loops have been already identified), then this is a candidate for theoptimization. If the operator on the line happens to be a multiply, thenthis line is marked as dummy.

Next, an identification of data and address manipulations is discussed.The process splits the lines of execution in the C program into data andaddress manipulation bins. While there is no difference between theseoperations from a programming point of view—they are just arithmeticoperations on variables—there is significant difference from a processorarchitecture point of view. A data manipulation operation is likely torely on a previous address manipulation operation to fetch the data frommemory. For a variety of reasons this differentiation is very important.To identify address and data manipulation operations, the systemperforms the following pseudo-code:

1) Walk through the lines of code identifying all lines that operate onpointers (declared either as pointers or arrays in the C program).

2) Mark lines as address manipulation operations.

3) Ensure results are only for fetching data from memory.

Operation 3 is done as the marked lines could either lead to or could bedependent on other lines of code. These lines are tracked to ensure thatthe results of these lines are meant only for the purpose of fetchingdata from the memory. To implement this, the variables involved in theselines are tracked and the process makes sure that these variable values(without another independent reassignment) are not used for any otherpurpose. If this is the case, these lines are also marked as addressmanipulation operations. Any line that supplies data to both address anddata manipulation operations is classified as data manipulationoperation.

Flag identification is discussed next. The flag detection algorithmwalks through all the global variables declared in the C program usingthe following pseudo-code:

-   -   1) Each global variable is checked for the possible values taken        during the course of execution.    -   2) If the only values taken are 0 and 1, then proceed to the        next step.    -   3) Check all lines where the values are set. The values can be        set only through an immediate operation (in other words an        explicit assignment such as x=1) to proceed to the next stage.        If it is derived as a result of assignment of another variable        (such as x=y), the right hand side variable is back tracked to        see if that variable confirms to this rule. There is a logic in        place to prevent a perpetual lock up situation. If there any        other operation (such as arithmetic, logical or memory fetch),        the variable cannot be a flag.    -   4) Check actual assignment lines. The assignment can only be to        one of the values (0 or 1) in the general flow. Assignment to        the other value can happen only within a conditional check flow.    -   5) Mark variable that fits the above rules as a Flag.    -   6) A hardware flag corresponding to this definition is specified        in the processor architecture and will be automatically        synthesized.    -   7) All lines that set the values of this variable are marked as        flag manipulation lines.    -   8) The architecture definition also creates instructions that        enable these operations.    -   9) All these lines that are marked as flag manipulation lines        would use these newly defined instructions rather than standard        instructions so that they refer to specific hardware flags.

Next, variable reuse is discussed. The process to minimize the number ofvariables used in each line is as follows:

-   -   1) At each line (that has not been marked dummy), mark the left        hand side variable and right hand side variables.    -   2) If any of the right hand side variables are not referred to        in any of the subsequent lines, then that variable is used to        replace the existing left hand side variable.    -   3) All lines that refer to the left hand side variable in the        lines below are changed to refer to the variable that has        replaced it.    -   4) In all of the process above, the algorithm limits the search        scope to the zone where the left hand side variable is not        re-assigned.

Once the parser phase is completed, the data structure of lines isrevisited by a parameter extraction module or parameter extractor (4).

FIG. 4 shows in more details the exemplary parameter extraction module(4). A variety of relevant parameters are extracted by walking throughthis data structure. For example, the total cycles needed to execute theprogram for the given test case is calculated. This is performed bynoting the number of clock cycles needed to execute any given line(value derived from the property of the operator used in that line) andadding any clock cycle penalty suffered by the line due to datadependency or other reasons and multiplying this value by the number oftimes the line was executed (value known by parsing the logged resultsof the execution of the C program). This operation is repeated for everyline and the total clock cycles needed for the execution of the entiretest case is arrived at. Similarly, the number of clock cyclesassociated with address calculation, memory load, memory store,conditional branches, loops and data manipulation are arrived at. A listis built that tags the operator distribution for all of the abovecalculations. For example, in case of the total cycles, how many ofthese are associated with which operator is calculated. Additionally,for each operator, the distribution of usage across different datawidths is also calculated. At the end of this process, a table isgenerated, such as the exemplary table below:

Data Type 8 bit 16 bit 32 bit 64 bit IF 0 1227671 1985699 2251196 CASE 00 0 0 ADD 4 1584710 1101631 906162 SUB 0 598708 400 185602 MUL 0 14523030 0 DIV 0 0 0 0 LEFT_SHIFT 0 108314 32931 1301594 RIGHT_SHIFT 0 823510504 193194 MOVE_IMMEDIATE 0 45878 3230 15127 FUNCTION_CALL 0 7927413207890 0 MOVE 0 100092 1720 212810 MODULO 0 0 0 0 AND 0 7398 640 183780OR 0 200 0 74822 NOT 0 3287 10504 7218 XOR 0 210 0 1705735 CHECK 0 40281544 7944 FLAG_MAN 0 1 337065 0 GOTO 0 0 0 0 MEM_LOAD 4 1890207 935459760 MEM_STORE 0 215484 4400 1600

Similarly statistics about function calls is also built. For eachfunction call, the number of times it is called and the clock cyclespent on executing the function are calculated. The lines of code arethen sorted in the descending order of execution count.

The definition of the architecture that best suits the needs of the Cprogram is given as input of an iterative process. The first timeparameter extraction is run, the process calculates statistics using thebase architecture defined in 7. In addition to the output of parameterextraction, real time constraints associated with the C program and thehardware constraints associated with the product are also fed as inputto the architecture definition block. The block automatically generatesan architecture that would meet the performance requirement specified.This architecture can then be refined using an architecture optimizer toarrive at the optimal architecture. The first step in this stage is tomeet the real time constraints. The goal of this step is to define newinstructions and corresponding processor architecture that would reducethe total execution time and enable the processor to meet the real timeconstraints. In one exemplary implementation, the following operationsare performed:

-   -   1. The process retrieves list of lines sorted by the execution        count is available from the previous section. March along these        lines and identify groups of lines that occur in sequence. For        example the sorted list may look like 651, 652, 659, 802, 803, .        . . . In this case lines 651 through 659 are identified for the        first group. In marking lines as part of a single group, they        should have the same number of execution count. In the example,        it is important that 651, 652 and 659 have the same number of        execution count (the number of times these lines were counted in        the parsing of the logged results of the C program execution).        Then all lines between the first and last line of this list is        marked as a group.    -   2. These groups of lines have a high execution count and        therefore consume a substantial amount of execution time. If        these lines could be amortized into a single instruction, it        might result in reducing the execution time. A new instruction        is created by molding these instructions into one. Any such        candidate for molding is called Instruction Molding Candidate        (IMC).    -   3. Each group of instructions is checked for molding violations.        For example an unconditional GOTO (Jump) or a function call in        the middle of this group would invalidate the sequence. There        are other constraints such as a group of lines needing more data        variables than accessible from the register file. For example,        if the current architecture assumes a 2 read port register file        and the group of instructions need three variables to form a new        instruction, it is not possible to form a IMC using this group.        If there is only one write port to the register file and the        group of instructions writes out two variables, it is again not        possible to form an IMC from this group. So the system checks        for hardware related architecture constraints and locates a        sub-group (if it exists) that conforms to these conditions.    -   4. Within this sub-group, it is possible to form multiple IMCs.        For example, we could have a sequence of lines as listed below        forming a sub-group.

a=b+c;

d=a*2;

e=d AND b;

-   -    In this case, we have one IMC that consists of all three        operations, another that only contains the first two        instructions and another that contains the last two        instructions. All such possible IMCs are formed.    -   5. The group of lines from which the IMC has been defined is one        place in the program where these instructions occur in sequence.        It is possible that such sequences could exist in other places        in the C program. This is now investigated. It is important to        note that the condition to be met is not only the sequence of        instructions, but also the relative operator dependency. Take        the example described in point 4. If we come across another        place in the code where we find the sequence

x=y+z;

r=x*2;

f=r AND x;

-   -    then this could not be counted as a place where the IMC can be        used. So the algorithm not only checks for the same sequence of        instructions, but also the same variable dependency structure.        All such places where the IMC can possibly be used is tagged        along with the IMC.    -   6. The result of point 5 is used to calculate the potential        reduction is execution cycle count of the program if this IMC        were to be actually used as a new instruction.    -   7. At this point, the algorithm queries a hardware synthesizer        block to get the timing and area for implementing this IMC as an        instruction.    -   8. The process is repeated for all lines.    -   9. Some of these IMC potentially replace full functions. In that        case, the program flow would change significantly when these        IMCs are implemented as instructions. So they are marked as        special IMCs.    -   10. At this point the list of all possible IMCs for the current        stage has been arrived at. An optimization cost function is used        to pick the IMCs that need to get implemented as instructions.        The algorithm is not tied to a specific cost function although        the likely cost functions would be the ones that consider the        timing of the new instruction and how it impacts the execution        time of the C program. Calculating the impact on execution time        is a non-trivial task. Any IMC that has a a timing which is less        than the current clock cycle time does not impact the        architecture significantly. However, we are likely to encounter        IMCs who have a timing that is greater than the current clock        cycle. If these IMCs are accepted, the clock cycle time would        increase for all instructions and could possibly increase the        execution time although the number of clock cycles could be        less. Hence this calculation and decision is non-trivial.    -   11. In order to perform this calculation, IMCs are grouped        together into dependent groups. In order to perform this        grouping, the principle of complete statistical independence is        applied. Any IMCs that are not completely statistically        independent are grouped together. This is a rather conservative        approach, but one that is needed nevertheless. Grouping helps in        preventing double counting. Whenever a decision regarding        increasing the cycle time has to be made, all the AMC groups are        investigated to find out IMCs that might benefit from this        increase in cycle time (i.e., they also have timing which is        greater than the current cycle time but is lower than the new        cycle time). The best IMC (one that reduces the cycle count the        most) is picked from each group that stands to benefit. Using        this information the new execution time is calculated and if        this is lesser than the current execution time, the cycle time        is increased.    -   12. Each time the cost function identifies a IMC, the        corresponding instruction is defined and the architecture        definition is altered to accommodate this new instruction. The        effect of this new instruction on other IMCs is investigated and        the IMCs are rationalized.    -   13. Once the cost function cannot find any IMCs that fit the        requirement, the process is halted.    -   14. The architecture is passed as input to the parser.        Parameters are extracted again and the architecture definition        is revisited. This loop runs until the cost function cannot find        any new IMCs that fit the requirement.    -   15. If the real time constraints are still not met, some other        architecture variables are considered. For example the number of        read and write ports to register file, the number of read and        write ports to memory, the width of these ports, scalarity of        the processor and instruction grouping rules to optimally use        the hardware and such. These are variables that are not strictly        related instructions, but are essential to finding new IMCs that        could further help in reducing the execution time. The loop is        again repeated. As an example, let us consider scalarity. The        algorithm marches through the data structure of lines and        identifies the amount of instruction level parallelism inherent        in the code. This is used to optimize the hardware resource        available for the execute unit of the processor and define the        scalarity of the processor. The available hardware for the        execute unit is also used to define the instruction grouping        rules as well. It is important to note that the grouping rules        arrived at in this case are optimal for the code presented and        not arbitrarily chosen. A cost function to minimize idle slots        is used to define this.    -   16. If the real time constraints are still not met by increasing        any of the above mentioned architecture variables, then the        algorithm identifies macro parallelism and optimizes the number        of cores necessary for the identified parallelism. The algorithm        also splits the instruction stream to be executed in each of the        cores.    -   17. At any point of time, if the real time performance        constraints are met, the algorithm exits the loop.    -   18. If the performance constraints are not met even after all        variables have been looked at, the algorithm identifies this        application not fit for a programmable solution and recommends a        co-processor architecture where some of the functionality is        implemented as a data path dedicated hardware. The list of IMCs        synthesized and IMCs rejected (since the timing was greater than        the current cycle time) is used to define this co-processor        architecture.

Another algorithm tracks the marching of pointers and builds statisticsregarding the stride and memory access patterns. These statistics inaddition to the information attained about memory dependency are used tooptimize cache pre-fetch mechanism and cache policy.

Once the real time performance constraints have been met, other hardwareconstraints are visited. The hardware constraints can be represented interms of area, power and some other parameters. The algorithm then finetunes the architecture to reduce redundant paths and non-criticalsections to meet these constraints. Another algorithm is employed tocheck all the instructions available and verify the benefit provided bythese instructions. A cost function is used to perform this check. Allinstructions that can be safely removed without impacting the real timeperformance constraints are removed from the set so that instructiondecoding time is reduced to the minimal level. These constraints may ormay not be met. They are used so that the architecture defined is not abig overkill for a given application and any scope for reducing thecomplexity of the architecture is investigated.

FIG. 5 shows an exemplary system to automatically generate architecturedefinition. In this process, output from parser (3) is provided to theparameter extraction module (4) as previously discussed. Next, theprocess forms a group with a set of program lines based on predeterminedrules (60). Next, a set of molding rules are retrieved (61). The processchecks for molding rule violations and splits the program lines intosub-groups (62). The process finds the IMCs (63) and identifies placesfor IMC usage (64). Next, the process determines cycles associated witheach IMC (65). Timing and area determinations are also performed for theIMCs (66). The information is fed back to operations 60 and 63 and alsoprovided to identify IMCs that can replace full functions (67). Next,the IMCs are grouped based on statistical dependence (68). The processuses cost functions to pick the best IMC (69) and implements a newinstruction for the best IMC (70). An iterative determination of effectsof new instruction on other IMCs is done (71), and the determination isprovided to operation 69 to pick the best IMC and operation 70 toimplement new instructions. This is done until a threshold is reachedand the new instruction is added to the architecture definition (5). Theprocess checks the impact of the new instruction on other architecturevariables (72) and accepts or rejects the new instruction. The processis then repeated until a predetermined threshold is reached that meetsthe constraints placed on the custom IC.

FIG. 6 shows an exemplary system to automatically generate a custom IC.The system of FIG. 6 supports an automatic generation of an architecturefor a custom hardware solution for the chosen target application. Thetarget application specification is usually done through algorithmexpressed as computer readable code in a high-level language like C,Matlab, SystemC, Fortran, Ada, or any other language. The specificationincludes the description of the target application and also one or moreconstraints such as the desired cost, area, power, speed, performanceand other attributes of the hardware solution.

In FIG. 6, an IC customer generates a product specification 102.Typically there is an initial product specification that captures allthe main functionality of a desired product. From the product, algorithmexperts identify the computer readable code or algorithms that areneeded for the product. Some of these algorithms might be available asIP from third parties or from standard development committees. Some ofthem have to be developed as part of the product development. In thismanner, the product specification 102 is further detailed in a computerreadable code or algorithm 104 that can be expressed as a program suchas C program or a math model such as a Mathlab model, among others. Theproduct specification 102 also contains requirements 106 such as cost,area, power, process type, library, and memory type, among others.

The computer readable code or algorithm 104 and requirement 106 areprovided to an automated IC generator 110. Based only on the code oralgorithm 104 and the constraints placed on the chip design, the ICgenerator 110 automatically generates with few or no human involvementan output that includes a GDS file 112, firmware 114 to run the IC, asoftware development kit (SDK) 116, and/or a test suite 118. The GDSfile 112 and firmware 114 are used to fabricate a custom chip 120.

The instant system alleviates the issues of chip design and makes it asimple process. The system shifts the focus of product developmentprocess back from the hardware implementation process back to productspecification and algorithm design. Instead of being tied down tospecific hardware choices, the algorithm can always be implemented on aprocessor that is optimized specifically for that application. Thesystem generates this optimized processor automatically along with allthe associated software tools and firmware applications. This wholeprocess can be done in a matter of days instead of years that it takesnow. In a nutshell the system makes the digital chip design portion ofthe product development in to a black box.

In one embodiment, the instant system product can take as input thefollowing:

Computer readable code or algorithm defined in C/Matlab

Peripherals required

Area Target

Power Target

Margin Target (how much overhead to build in for future firmware updatesand increases in complexity)

Process Choice

Standard Cell library Choice

Testability scan

The output of the system may be a digital hard macro along with all theassociated firmware. A software development kit (SDK) optimized for thedigital hard macro is also automatically generated so that futureupgrades to firmware are implemented without having to change theprocessor.

The system performs automatic generation of the complete and optimalhardware solution for any chosen target application. While the commontarget applications are in the embedded applications space they are notnecessarily restricted to that.

By way of example, a computer to support the automated chip designsystem is discussed next. The computer preferably includes a processor,random access memory (RAM), a program memory (preferably a writableread-only memory (ROM) such as a flash ROM) and an input/output (I/O)controller coupled by a CPU bus. The computer may optionally include ahard drive controller which is coupled to a hard disk and CPU bus. Harddisk may be used for storing application programs, such as the presentinvention, and data. Alternatively, application programs may be storedin RAM or ROM. I/O controller is coupled by means of an I/O bus to anI/O interface. I/O interface receives and transmits data in analog ordigital form over communication links such as a serial link, local areanetwork, wireless link, and parallel link. Optionally, a display, akeyboard and a pointing device (mouse) may also be connected to I/O bus.Alternatively, separate connections (separate buses) may be used for I/Ointerface, display, keyboard and pointing device. Programmableprocessing system may be preprogrammed or it may be programmed (andreprogrammed) by downloading a program from another source (e.g., afloppy disk, CD-ROM, or another computer).

Each computer program is tangibly stored in a machine-readable storagemedia or device (e.g., program memory or magnetic disk) readable by ageneral or special purpose programmable computer, for configuring andcontrolling operation of a computer when the storage media or device isread by the computer to perform the procedures described herein. Theinventive system may also be considered to be embodied in acomputer-readable storage medium, configured with a computer program,where the storage medium so configured causes a computer to operate in aspecific and predefined manner to perform the functions describedherein.

The invention has been described herein in considerable detail in orderto comply with the patent Statutes and to provide those skilled in theart with the information needed to apply the novel principles and toconstruct and use such specialized components as are required. However,it is to be understood that the invention can be carried out byspecifically different equipment and devices, and that variousmodifications, both as to the equipment details and operatingprocedures, can be accomplished without departing from the scope of theinvention itself

What is claimed is:
 1. A method to automatically generate a processorarchitecture for a custom integrated circuit (IC) described by acomputer readable code, the IC having at least one or more timing andhardware constraints, comprising: a. extracting parameters defining theprocessor architecture from a static profile and a dynamic profile ofthe computer readable code; b. iteratively optimizing the processorarchitecture by changing one or more parameters until all timing andhardware constraints expressed as a cost function are met and using acompiler to compile, assemble and link code for each processorarchitecture iteration to arrive at a customized architecture; and c.synthesizing the generated processor architecture into a computerreadable description of the custom integrated circuit for semiconductorfabrication.
 2. The method of claim 1, comprising optimizing processorscalarity and instruction grouping rules
 3. The method of claim 1,comprising optimizing the number of processor cores needed andautomatically splitting an instruction stream to use the processor coreseffectively.
 4. The method of claim 1, wherein the processorarchitecture optimization comprises changing an instruction set,including reducing the number of instructions required and encoding theinstructions to improve instruction access and decode speed, and toimprove instruction memory size requirement.
 5. The method of claim 1,wherein the processor architecture optimization comprises changing oneof: a register file port, port width, and number of ports to datamemory.
 6. The method of claim 1, wherein the processor architectureoptimization comprises changing one of: data memory size, data cachepre-fetch policy, data cache policy instruction memory size, instructioncache pre-fetch policy, and instruction cache policy.
 7. The method ofclaim 1, wherein the processor architecture optimization comprisesadding a co-processor.
 8. The method of claim 1, comprisingpre-processing the computer readable code by: a. determining a memorylocation for each pointer variable; and b. inserting an instrumentationfor each line.
 9. The method of claim 1, comprising changing theprocessor instruction set by automatically generating new instructionsuniquely customized to the computer readable code to improve performanceof the processor architecture, further including: a. removing dummyassignments; b. removing redundant loop operations; c. identifyingrequired memory bandwidth; d. replacing one or more software implementedflags as one or more hardware flags; and e. reusing expired variables.10. The method of claim 1, wherein extracting parameters furthercomprises: a. determining an execution cycle time for each line; b.determining an execution clock cycle count for each line; c. determiningclock cycle count for one or more bins; d. generating an operatorstatistic table; e. generating statistics for each function; and f.sorting lines by descending order of execution count.
 11. The method ofclaim 1, comprising molding commonly used instructions into one or moregroups and generating a custom instruction for each group to improveperformance (instruction molding).
 12. The method of claim 11,comprising checking for a molding violation in the new instructioncandidate.
 13. The method of claim 11, comprising applying a costfunction to select an instruction molding candidate (IMC).
 14. Themethod of claim 11, comprising grouping instruction molding candidates(IMCs) based on statistical dependence.
 15. The method of claim 1,comprising determining timing and area costs for the architectureparameter change.
 16. The method of claim 1, comprising identifyingsequences in the program to be replaced with by one or more instructionmolding candidates (IMCs) and rearranging instructions within a sequenceto maximize IMC usage while retaining code functionality.
 17. The methodof claim 1, comprising passing information regarding candidate code touse a newly synthesized instruction to a compiler.
 18. The method ofclaim 1, comprising tracking pointer marching and building statisticsregarding stride and memory access patterns and memory dependency tooptimize cache pre-fetching and a cache policy.
 19. A system toautomatically generate a custom integrated circuit (IC) described by acomputer readable code or model, the IC having at least a floating pointparameter, a performance constraint, and a static range and a dynamicrange for an input signal, comprising: a. means for extractingparameters defining the processor architecture from a static profile anda dynamic profile of the computer readable code; b. means foriteratively optimizing the processor architecture by changing one ormore parameters to meet all timing and hardware constraints; and c.means for synthesizing the generated processor architecture into acomputer readable description of the custom integrated circuit forsemiconductor fabrication.
 20. The system of claim 19, comprising a.means for molding commonly used instructions into one or more groups andgenerating a custom instruction for each group to improve performance(instruction molding); b. means for checking for a molding violation inthe new instruction candidate; c. means for applying a cost function toselect an instruction molding candidate (IMC) and means for groupingIMCs based on statistical dependence.